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\57) Abstract 

System and method for improving the performance of learning agents such as neural networks, genetic algorithms and decision trees 
that dcnve prediction methods from a training set of data. In part of the medwd, a population of teaming agents of different classes is trained 
CHI the date set, each agent producuig in response a prediction method based on the agent's input repiescntaUon. Feature combinations arc 
Mtracted from the predicUon methods produced by die learning agents. Tbc input representations of the learning agents are then modified 
ty includmg tfiercin a feature combination extracted from another learning agent. In anotfwr part of the method, the parameter values of 
ttie teammg agents arc changed to improve the accuracy of the prediction method. A fitness n^ure is determined for each teaming agent 
based on the prediction mediod the agent produces. Parameter values of a learning agent arc then selected based on the agent's fitness 
measure. Vanation is mtroduced mto the selected parameter vahies. and another learning agent of the same class is defined using the varied 
parameter vahi^. TTie leammg agents arc then again trained on the data set to cause a learning agent to produce a prediction me^ based 
on the denved feature combinations and varied parameter values. 
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SYSTEM ANB METHOD FOR COMBINING MULTIPLE LEARNING 
AGENTS TO PRODUCE A PREDICTION METHOD 

FIELD OF THE INVENTION 

5 This invenUon relates generally to artificial intelligence and machine learning. More 

particularly, this invention relates to a system and method for combining learning agents that derive 
prediction methods from a training set of data, such as neural networks, genetic algorithms and decision 
trees, so as to produce a more accurate prediction nxrthod . 

10 BACKGROUND OF THE INVENTION 

Prior machine learning systems typically include a database of information known as a 
training data set from which a learning agent derives a prediction method for solving a problem of 
interest. This prediction method is then used to predict an event related to the information in the 
training set For example, the training set may consist of past information about the weather, and the 
1 5 learning agent may derive a method of forecasting the weather based on the past information. The 
training set consists of data examples which are defmed by features, values for these features, and 
results. Continuing the example of weather fwecasiing, an example may includes features such as 
barometric pressure, tempCTature and precipitation with their corresponding recorded values (mm of 
mercury, degrees, rain or not). The resulting weather is also included in the example (e.g., it did or did 

20 not rain the next day). The learning agrat includes a learning method, a set of parameters for 

implementing the method, and a^ input representation that determines how the training data's features 
will be considered by the method, Typical.leaming methods include statistical/Bayesian inference, 
decision-tree induction, neural networks, and gcneUc algorithms. Eadi learning method has a set of 
parameters for which values are chosen for a particular nnplementation of the method, such as the 

25 number of hidden nodes in a neural network. Similarly, each application of a learning method must 
specify the representation, that is, the features to be considered; for example, the semantics of the 
neural netwodc's ii^ut nodes. 

One clear lesson of machine learning research is that problem representation is crucial 
to the success of all learning methods (see. e.g. Dietterich, T.. "Limitations on Inductive Learning," 

30 Proceedings of Sixth International Workshop on Machine Learning (pp. 125-1 28). Ithaca. NY: 
Morgan Kaufinan (1989); RendeU. L.. &. Cho, H.. "Empirical Ixaming as a Function of Concept 
Character.- A/ocAm^ Learning, 5(3), 267-298 (1990); Rendell L., A Ragavan. H., "Improving the 
Design of Induction Methods by Analyzing Algorithm Functionality and Data-based CcHwept 
Complexity," Proceedings ofUCAI^ (pp. 952-958). Chamber/, France (1993). which arc all hereby 

35 incwpwated by reference). However, it is generally the case that the choice of problem representation 
is a task done by a human experimenter, rather than by an automated machine learning system. Also 
significant in the generalization performance of machine learning systems is the selection of the 
learning method's parameter values, which is also a task generally accomplished by human "learning 
engineers" rather than by automated systems theiTL%]ves. 
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The effectiveness of input representations and free parameter values are mutually 
dependent. For example, the qjproprialc number of hidden nodes for an artificial neural networic 
depends crucially on the number and semantics of the input nodes. Yet up to now, no effective method 
has been developed for simultaneously searching the ^aces of representations and parameter values of 
5 a learning agent for the optimum choices, thereby improving the accuracy of its predicticm method. 

Machine learning systems have also conventionally used a single learning agent for 
producing a prediction method. Until now, no effective method has been developed that utilizes the 
interaction of multiple learning agents to produce a more accurate prediction method. 

An objective of the invention, therefore, is to provide a nscthod and system that 
10 optimizes the selections of parameter values and input representations far a learning agent. Another 
objective of the invention is to provide a simple yet effective way of synergistically combining multiple 
learning agents in an integrated and extensible framework to produce a more accurate prediction 
method. With mulUplc and div«:se Icaniing agents sharing their output, the system is able to generate 
and exploit synergies between Ihe learning methods and achieve results that can be superior to any of 
1 S the individual methods acting alone. 

SUMMARY OF THE INVENTION 

A method of producing a more accurate prediction method for a problem in 
accordance vwth the invention comprises the following steps. Training data is provided that is related 

20 to a problem for which a prediction method is sought, the training dau initially represented as a set of 
primitive features and their values. At least two learning agents are also provided, the agents including 
input representations that u^ the primitive features erf" the training data. The method then trains the 
learning agents on the data set, each agent producing in response to the data a predicliOT method based 
on the agent's input. Feature combinations are extracted from the prediction methods produced by the 

25 learning agents. The input representations of the learning agents are then modified by inchiding feaUue 
combinations extracted frtwn another learning agent The learning agents arc again trained on the 
augmented training dau to cause a learning agent to produce another prediction method based on the 
agent's modified input representation 

In another method of the invention, the parameter values of the learning agents are 

30 changed to improve the accuracy of the prediction method. The method includes determining a fiUiess 
measure for each learning agent based on the quality of the predicUra m^hod the agwit produces. 
Parameter values of a learning agent are then selected based on the agent's fitness measure. Variation 
is introduced into the selected parameter values, and another learning agent is defmed using the varied 
parameter values. The learning agents are again trained on the data set to cause a learning agent to 

3 5 produce a prediction method based on the varied parameter values. 

The two methods may be used separately or in combination. Results indicate a 
synergistic interaction of learning agents when both methods are combined, which provides yet a more 
accurate prediction method. 
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The foregoing and other objects, features, and advantages of the invention will become 
more apparent licffn the following detailed description of a preferred embodiment vAdch proceeds with 
reference to the accompanying drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 is a block diagram of the software architecture of a machine learning system 
according to the invention. 

Fig. 2 is a flowchart of an ov^all method for producing a prediction method according 

to the invention. 

Fig. 3 is a flowchart of a preferred method of creating an initial agent papulation 
according to the overall method of Fig. 2. 

Fig. 4 is a flowchart of a preferred method of evaluating learning agents' prediction 
methods according to the overall method 

Fig. 5 is a flowchart of a preferred method of creating a next population of learning 
1 5 agents according to the overall method 

Fig. 6 is a flowchart of a prefenned method of extracting feature combinations 
accofding lo the overall method 

Fig. 7 is a flowchart of a preferred method of selecting input representations according 
to the overall method. 

^'^B- ^ a pictorial diagram of a population of learning agents created in accordance 
with the invention. 

Fig. 9 is a pictorial diagram of the method of creating a next popuiaUon of learning 
agents in accordance with the invention. 

Fig. 10 is a pictorial diagram of the method of selecting input repiescntaUons in 
25 accordance with the inventicm. 

Fig. 1 1 is a graph of the fitness of predicticm methods as a funclicm of the generations 
of learning agents executed 

Fig. 1 2 is a graph of the contribmions of the inventive methods to the accuracy of 
prediction methods as a function of the genwations of Icaniing agents executed 

Fig. 1 3 is a graph of the accuracy of prediction methods resulting from the synergistic 
interaction of two learning agents as a function of the generations of learning agents executed 



30 



DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT 

The invention can be implemaited in any number of ways with a computer program 
35 executing on a general purpose computer. The preferred embodiment includes a control program 
written in Carnegie Mellon Common Lisp 1 7f and the PCL Common Usp Object System, resident 
computcr-readable medium such as a hard or floppy disk, tape cartridge, etc.. and ninning on a Silicon 
Graphics hjdigo^ computer workstation. The Icanimg agents incorporated into this embodiment are all 
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publicly available and include (he C4.5 decision tree induction system, the C4.5rulcs extraction tool 
(Quinlan. 1991) . the LFC-H^ constructive induction program (RcndcU & Ragavan. 1993; Vilalta, 
1 993). and the conjugate gradient descent trained fccdfonvard neural network (CG) from the UTS 
neural network simulation package (van Camp. 1994). Each of these learning agents has a method 
5 associ ated with an object class which defines how to execute the method, how to parse the results 
returned, and what the list, or vector, of finee parameters is for the agent Of course, the invention is not 
limited to the described computer or learning a^ts. Any computer system (single CPU. 
multiprocessor, etc.) capable of executing the control program and learning agcmts can be used, and 
any learning agents that produce prediction methods can be used 
10 QvCTview of the Machine Learning System 

Fig. 1 is a block diagram of the software architecture of a madiinc learning system 20 
according to the invention. A population of learning agents such as agents 24-28 arc controlled by an 
agent control program 30. Each learning agcmt includes a learning m^od. a set of parameteis for 
implementing the method, and a represcntotion of training data s features to be considcrcd by the 
1 5 method. Fig. 8 shows a typical agent population comprising four classes of learning agents: 

statistical/Bayesian inference method-based agents 24a-c, decision-tree induction method-based agents 
26a-d, neural netwcMrk method-based agents 28a<. and genetic algorithm method-based agents 29a-c. 
Each learning agent includes a set of paramrters values, such as the number of hidden nodes in a neural 
network (five hidden in agent 28a. zero hidden in agent 28b). SimUarly, each karaing agent includes a 
rcpiesenlalion of the featurcs considered by the learning method, such as the sanantics of the neural 
network's input nodes (F 1 . F2/F3, etc.). Control program 30 sets parameter values and executes the 
learning method associated with each agent. The data input to each learning a^t is derived torn a 
knowledge data base such as training data set 32 by u^ansforming the original data set accwding to a set 
of feature combinations specific to each agent. These specific transformation methods arc represented 
25 in Fig. 1 by transfomas 35, 36. and 37. However, in the initial execution of system 20. all agents are 
given the original, untransformed version of the training data. 

Training data set 32 includes data examples which are defined by features, values for 
these features, and results. If the training set relates to weath^ prcdicUon, an example may includes 
features such as barwnetric pressure, temperature and prcc^jiiation with their cone^xmding recorded 
30 vahies (mm ofmercuiy, degrees, rain or not). The resulting weather is also included in the example 
(eg., it did rain the next day or not). 

Each of the learning agoits 24-28 produces in response a prediction method based on 
the agent's input represcntatioa The prediction methods are examined by a prcdicUon method 
cvaluator 38 that detcnnines an unbiased estimate of the accuracy of the predicticm method generated 
35 by each agent. The prediction accuracies arc a%d by control program 30 to select new parameter 
settings for a next execution of the learning agents. 

The prediction methods produced by the learning methods are also examined by a 
compound feature constructor 39, Coastructor 39 extracts the combination of featurcs in the agent's 
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ii^ut representation thai are detexinined to be the most important in pnxlucing the agent 's prediction 
method. These extracted feature combinations are passed to the appropriate data transforms 35-37 for 
re-rcprescnling the training data for the next execution of the learning agents. As will be desmbed. a 
feature combination can be passed to the same learning agent from whose prediction method it was 
5 extracted, to a different agent of the same class, or to a learning agent of a different class. 

Both the parameter values and the input representations for learning agents 24-28 
change for each system ©cccuiion, thereby aeating a new gencratiOT of learning agents each time. 
Execution of the learning agents continues until a prediction method that achieves a desired Bccuracy 
(such as 1 00%) is found or until a prcdefmed time limit has expired 
10 The Overall Method 

A method for producing a prediction method according to the invention functions by 
evolving the population of learning agents 24-28. The population Ls defmed by a classification task T. 
a set of classified examples of that task expressed in a primitive repi^tation Ep. a fitness function for 
agents/^, and a set of learning agents A: 
15 ^{T,E^A,A} 

A fitness function for agents maps a member of the set A to a real number between 0 
and 1 For convenience, it is useful to define to be the subset of where all the agents use 
learning method L (see below). 

Each learning agent A is defmed by a learning method L. a vector of parameter values 
v, a fiiness function for memcs^, and an ordered set of problem representation transformations R: 

The set, or vector, of parameter vahics may have different length for different learning 
methods. Each member of the set of problem representation transformations is a mapping fixHn an 
example (e, , E^) to a value which is a legal element of an inpux rcprescniation (or L (e.g. a real 
number, nominal value, bit, horn clause, etc.). The individual mappings arc called iiiem«. The 
ordered set of feattire combinations (RJ is called the agent's memome. and the vector of parameter 
values is called its genome. The fitness function for memcs f;. is a function that maps a member of the 
set Rj to a real number between 0 and 1 . 

A transformed example e^ is the result <rf^ sequentially applying each of the prc*lem 
representation transformaticms r^ e R, to an example e^ , E^. The q>plic^n of the set of problem 
representation transformations to the set of examples in primitive representation results in a set of 
transformed examples, which is called E, When EpE,, the transfoiroation is said to be the identity. 

Given a populaUon as defmed above, the process of coevolution is shown in Fig. 2 and 
in the pseudocode in Table 1. 
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Initialize the population with random legal parameter vectors and the identity 
repre^ntation transformation. 

♦ Detennine the phenotype of each agent i by applying learning algorithm to 

transformed examples Ei using parameter values v^ using K-way cioss-validaticm. The phraotype is 
an ordered set of: the output of the learning agent's cross-validation nms, the cross validation 
accuracies and the Icaniing time. 

Detennine the fitness of each agent by applying f^ to the phcnolype of each agent. 

Create new mcmes by parsing the output in each learning agent's phenotype and 
extracting important feature combinations. The union of all memes generated by all members of the 
population is called the memc pool. 

Exdiange mcmes. For each agent i, apply iis meme Gtncss fimcUon F„ to elements 
of the meme pool. Select the agent's target number of mcmes (a free parameter) from the meme 
pool with a probability proportional to the fiUiess of the memes. 

Repeat from * memc-transfers-per-generation times 

Detennine phenotype of the agents with their new memes 

Determine the fitness of the agents 

Create the next gencraticMi of agents by mutation, crossover and fitness proportional 
reproduction. Agents whose genomes are mutated keep their memomes intact. For each pair of 
agents whose genomes are crossed over to create a new pair of agents, the mcmomcs of the parents 
are arbitrarily assigned to the ofi^ring. 

Repeat from * until a member of the population can solve the problem 



Table I 

As a first step 40. an initial population of Icaming agents is created. These agents are 
25 typically though not necessarily of diffaent classes, such as agents 24-28. Each of the learning agents 
is thai trained on the training data (step 42). This training comprises supplying the data to the learning 
methods in a form that allows the methods to utilized their parameter values and input teprescntatitms 
As a result of the training, eadi of the learning agents produces a prediction method. The prediction 
methods are evaluated to determine their fitness (step 44). A decisira based an the fitness measure (or 
30 time expired) is then made to select a prediction nwthod or to continue to evolve the learning agents 
(step 46). If time has expired or the fitaess measure of a prediction method is sufficient, then the 

methodends. Ifneither of these events occurs, then the method proceeds to produce another set of 
prediction methods. 

The next step in the method is to extract feature combinations from the trained 
35 learning agents for use in future input representations (step 50). With knowledge of these extracted 
feature combinations, the control program transforms the features of the U-aining data to reflect the 
funirc input representations (step 52). For example, if it dctennined that the mathematical combination 
of Fl times F2 is important for the learning agents, then the training data is transformed to provide a 
feature that has the value of Fl times F2. 
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AnotW step in the method is to select parameter values fium the learning agents 
based on the fitness of their prediction methods and create therefrom a next population of learning 
agents (step 54). As will be described, this selection inchides copying and vaiying parameter values 
among learning agents in a class. 

The m^hod concludes with selecting input representations for die next population of 
learning agents (step 56). This step is the process for evaluating the various fitness combinaticms 
extracted from the prediction methods and then matching them with the different learning agents in the 
population. 

Creating the Initial Agent Popalfltinr^ 

Fig. 3 is a flowdiart 6f a prefetred w^ of creating an initial agent population 
according to the invention. To begin, each learning agent has a set of primitive features as its initial 
input representation (60). There arc at least two classes of learning agent, each with a hst of 
parameters and their legal ranges (62). Training data set 22 is also provided (64). 

First, a learning agent is selected tarn a class, such as a neural network-based agent 
28a (step 60). The agent's parameta- values arc then selected at random within their legal ranges (step 
62). The agent's input representotion is also set to an initial representation (step 64). Adecisicmis 
then made if there are enough learning agents provided (step 66). If not. more agents of various classes 
are created (slq>s 60^). If so. the creating step is complete, and the method proceeds with training 
the agents population (step 68). 
20 Evaluating Agents' Prediction Methnrf^ 

In addition to defining the fitness of features, system 20 must define the fitness of 
learning agents themselves for the evolution of pa^eter values. 

Fig. 4 is a flowchart of a preferred way of evaluating the agents' prediction methods. 
While a number of techniques are known in the art for evaluating the accuracy of a prediction method. 
25 the preferred way uses the following steps. First, cross validation is applied to the prcdicticm method of 
each learning agent to delennine an unbiased estimate of the method's accuracy (as a percent of conect 
predictions) (^ 70). Cross validatira is a well established statistical tedmiqi» that is widely know in 
the machine learning art. Sec. for example, its description in Comparer 5^ftrm5 That LeamM "^/ass 
andKulikowski(1991). Additionally, the time required for a learning agent to prod^ 
method is determined (step 72). The cross vahdation number and time are then combined to calculate 
a fitness measure of the trained agent (step 74). 

The fitness fimction for learning agents was selected to evolve learning agents that arc 
accurate, robust and fast: 
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wherc C(AO is the cross-validation accuracy of the agent A^. S(/0 is the standard deviation of that 
accura<qr. is the time it took for thai agent to Icani, is the mean 

agoit, S(t J is the standard deviation of that mean, and k is a constant that trades off the value of 
accuracy versus that of learmng time. Execution time b measured as the number of standard deviations 
5 horn the mean learning time for that class of learning agents so that classes of learning agents with 
differwit Uraining times can coexist in the same population. In the inventive system, k was selected to 
be 3 and accuracies and standard deviations arc measured in percentage points. So, a learning agent 
that had a mean accuracy of 80%, a standard deviaUon of 9% and took 1 standard deviation less than 
the mean training lime for that class of learning agent would g^ a fitness score of 80 - /9-(.l )=78. 

10 Extracting Feature Combinations rMemea^ 

The main difference between coevolution learning and other evolutionary methods 
derives fix>m the creation and exchange of manes. The memo creation process takes the output of 
learning agents that have been applied to a particular problem, and identifies combinations of the input 
features that the learning agent determined were relevant in making the desired distinction. The 

1 5 process of parsing output and creating new memes is specific to each learning method For example, a 
program that learns rules from examples might create new memes from the left hand sides of each of 
the induced rules. Or, a program that learned weights in a neural network might create new manes 
that were the weighted sum of the inputs to each of its hidden nodes (periiaps thr^olded to remove 
marginal contributiwis). Specific mcrae gcncraUon methods are discussed in more detail in the 

20 implementation section, bck>w. „^ 

Fig. 5 is a flowchart of a preferred way of extracting feature combinations frrai trained 
learning agents. Different classes of agents require different exU-action technique, and the cxtracUon 
technique selected is based cm the agent class (step 80). For genetic algorithm-based agents 30, the 
algorithms with the highest accuracy are selected (step 82) and are converted to rules (step 84). For 

25 slalislicaimayesian inference mrthod-based agents 24a-c, the statistics with the highest correlation are 
selected (step 86) and ccmverted to rules (step 88). For neural nctwoik method-based agents 28. the 
rules are extracted from the nctworic (step 90) using known techniques such as desOT 
and Shavlik in "Extracting Tree-structured Repre^aiiatiwis of Trained Network," Proceedings of the 
Conference on Neural Information Processing Systems (1996) or by Towell and Shavlik ih "The 

30 Extraction ofRefined Rules fonm Knowledge-based Neural Network." A/flcW«e^^ For 
decision-tree induction method-based agents 26, the rules are extracted from the tree (step 92) as dtme, 
for example, by the program C4.5rules described above and incwporated herein by reference. The left 
hand sides of the rules, from whatever agent, are then used to defmc the feature combinatiwis. 
Creatine a Next Population of l^inp Apt^t^ 

Th^ creation of the next generation of agents is a minor variant on traditional genetic 
algorithms. Instead ofhaving the genome be "bits" it is a vector of parameter values. The crossover, 
mutation and fitness proportional reproduction fiinctions are all identical with the related operaticms on 
bit vectors in genetic algorithms. It is, of course, possible to transform parameter vectors into bitstring 
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represcntalions themselves, if it were desirable. In addition to using parameter vectors instead of bits, 
the other difference is thai each subpopulation using a particular learning method (the P^'s) has its own 
type of parameter vccIot, since the firec parameters and their legal values vary among learning methods. 
These parameter vectors may be of different sizes, so crossover can only be ^phed vtrith members of 
the same subpopulatioa 

Fig. 6 is a flowchart of a preferred method for creating a next population, or 
generation, of learning agents in accordance with the invention. The stq)s of this method arc best 
understood with referaice to Fig. 9. which shows how they apply to change the parameter values of the 
three neural network-based agents 28a-c. From the evaluatiwj of the agents* prediction mcthods(Fig. 
3). fitness measures for each of the agents have been calculated. These arc shown in Fig. 9 as decimal 
fractions such as .67574 for agent 28a, .6823 for agent 28b, and . 1 867 for agent 28c. Parameter values 
for each agent in the next generation are then set (step 100). These are set prcfo-ably by randcHnly 
selecting a trained agent within the class (e.g., neural network-based agents) and copying its parameter 
values, vnth the probability of selection proportional to the fitness measure of the trained agent. For 
example, in Fig. 9. because of the relative fitness measures agent 28a is selected cmce for the next 
generation of agents (agent 28a'). agent 28b is selected twice (agents 28b' and 28b''), and agent 28c is 
not selected. 

hi a next st<^ 1 02. a subset of the new agents is then randcnnly selected, hi Fig. 9. this 
subset is agent 28b'. The subset, of course could be more than one new agent. For each agent of the 

20 subset, a parameter is randomly selected, and its value is randomly set (step 104). In the example of 
Fig, 9. the parameter indicating the number of hidden nodes in the neural network Ls selected and its 
value is changed frcmi 3 to 1 . This creates a mutated, fmal agent vereion 28f that differs from the 
intermediate form 28b' by the number of hidden rxxles. 

One or more pairs of agents are then selected from the intermediate population (step 

25 1 06) and parameter values are crossed over, or swapped (step 1 08). hx the example of Fig. 9, agrats 
28a' and 28b' are the pair of selected agents, and the parameter values for back propagation and 
lambda arc swapped to f<Mm two new agents 28d and 28e. 

The next population of learning agents is then ready for training on the training data, 
hi Fig. 9 , this next population comprises agents 28 d-f in the neural n^woik class. Other classes of 

30 learning agents would similarly have new generations. 
Selecting Input represcntalions for Uammp Ag eniR 

Memes (feature combinations) must be assigned fitnesses by agents. It is possible for 
each agent to have its ovra m^od of determining meme fitness. However, a simplified method is used 
in the preferred embodiment of the invention. Whenever a program generates a new mane, its fitness 

35 is defmed to be the average "importance" of the feature combinati(Hi. weighted by the accuracy of the 
learning agents that used the meme. In Durham s terms, the inventor of the new meme "imposes" its 
belief in its value on others, hi addition, receivers of memes have a parameter which determines how 
much weight they put on memes they generate themselves versus memes generated by others. This 
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mechanism could be straightfonvardly generalized to weighi (he fitnesses of memos generated by 
different classes of agents differently. Importance is defmed differently for different learning methods. 
For C4.5 and LFC-H-. the impcnrtance of a feature combinaUon is the ratio ofcoirectly classified 
examples that triggered the nile containing the feature to the total number of occurrences of the feature. 
5 So. for example, if "A and not B" is a feature extracted fiom a single C4.5 learning agent that had a 
80% cross-validation accuracy, and 9 of the 10 examples the feature appeared in were classified 
conectly by the rule containing the feature, it's importance would be 9/10 and its fitness would be 0.9 * 
0.8 = 0.72. For UTS. the importance of a feature is the absolute value of Ihe ratio of the weight fiom 
the hidden node that defmed the feature to the threshold ofthe output unit Ifthere is more than one 
10 output unit, it is the maximum ratio for any of the output units. Features that play a significant role in 
learning agents that are accunite have higher fitness than ieatmes that do not play as n^^ 
appear in learning agents that are less accurate. It is possible for a feataire to have relatively low 
prevalence in Ihe population and yet still be important if it tends to generate correct answers when it is 
used. 

* ^ Fig. 7 is a flowchart lhat iilustnited Ihe process for selecting input representations for 

learning agents. First the fitness for each feature (i.e.. feature combination) is calculated (step 1 10) in 

the manner described above. Then for each agent, the number of features specified by its "feanire 

count: parameter is selected (step 112). A decision is then made whether to select a feature generated 

by an agent of the same class (step 114). If so. the feature is selected based on its fiuiess (step 1 16). If 

not. a feature fiom another class of agents is selected, again based on its fitness (step 1 1 8). A finther 

decision is then made w^er the learning agent has selected enough features for its input 

representation (step 120). Ifnot. then steps 11 4- 120 are repeated until enough features have been 
selected (step 122). 

The process of Fig. 7 is diown graphically in Fig. 10. Learning agents such as agents 
25 28a. 26c. 29a. and 24b each produce a prediction method fiom which feanire combinatims are 
extracted by c(»siructDr 39 (Fig. I) and given a fibiess measure before being collected in a feature 
combination pod 1 22. Frwn this pool feature combinations are then selected for the next generation 
of learning agents such as 28d. 26f. 29e. and 24r 
A Specific hnpl ementalimi 

""•I™* implementation ofihe invention includes the «^ 
tree induction system and C4.5rules nile extraction tool, the LFC++ constnictive induction program 
and the corijugate gradient descent tt-aiited feedforward neural networic (CG) fiom the UTS neural 
network simulation package. Most of Ihe documented parameters for each of class of learning agent is 
included in Ihe fiee parameter vector for that system, along with a specification of either an upper and 
lower bound for the parameter's values cm- a li.st of possible parameter vahies. For example, the 
parameter vector for the UTS conjugate gradient descent learning agent includes Ihe number of hidden 
nodes in the netwwk. the maximum number of iterations before halting, the output tolerance 
(specifying how close to I or 0 an output has to be to count as mie or false), a fiag for whether ornot to 
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use competitive leaniing on the output, two nominal panunelers specifying the kind of line search and 
direction finding method to use. and three parameters specifying the maximum number of fimction 
evahialirais per iteration, the minimum function reduction necessaiy to continue the search and the 
maximum slope ratio to continue the search. Theses parametcn; are explained more fully in the 
documentation available with the source code. 

Each leaniing agent has a method for extracting new features from its output LFC++. 
like other construcUve inductim programs, specifically defines new features as part of its output For 
C4.5. the output of the C4.5niles tool was parsed so that each lefl hand side of each mle was reified 
into a new feature definition. For the neund netwoilc learning agent, a new feature was created for each 
hidden node defmed by the weighted sum of the inputs to that node, with any input whose contribution 
to that sum was less than the threshold of the node divided by the number of inputs removed from the 
sum. These exbwted features are added to the feature (i.e., meme) pool. 

Several other a^ts of this implementation must be specified in order far it to woik 
properly. Meme definitions must be stored in a canonical fonn so that it is possible to detect when two 
15 or inore generated memes are in eSfect identical and should have their fitness scores combine^^ Mcmes 
are represented as boolean combinations of primitives or mathematical formula over primitives, which 
can be straightforwardly compared, although the extension to fnst order predicate calculus would make 
this a more difficult problem. The number of memes that an agent uses (i.e. the dimcnsi<mality of its 
input representation) is treated as a fiw parameter of the agent, and aUowed to vaiy from two to twice 
the number of primitives. In addition, agents are allowed to prefer mcmes of their own creation to 
memes created by other learning agents. A single parameter, ranging over [O.I] specifies the internal 
meme preference of <aich agent 
SimulaUon Results on a Simnle Te<^ Piffhfrm 

The implemenution of system 20 described above was applied to an artificial problem 
designed to be moderately difficult for the machine learning programs inchided in the system. The 
problem was to identify whether any three consecutive bits in a nine bit string were on. c.g. 01101 1 01 1 
is false, and 001 1 1001 1 is true. This problem is analogous both to detecting a win in tic-tac-toe and to 
solving parity problems, which are known to be difficuh for many types of learning systems. There are 
512 possible examples, and the problem is simple enough so that the machine teaming systems used 
(even the neural network) run reasonably quickly. This is important, since each program is executed a 
large number of times (see the computational con^lexity section, below). 

This problem was run on a system that used only LFC-H- and the CG learning agents. 

TTk: population consisted of lOlJC-H- learning agents and lOCG leaniing agents, and was rm 10 
generations. Six fold cross-validation was used, and there was one meme exchange per genenOion. 
This rather modest problem therefore required 2400 learning agent executions, taking more than four 
days ofR4400 CPU time. (The CPU time use was dominated by the 1 200 neunU network training 
runs.) Despite the large amount of computation required for this simple problem, it is reasonable to 
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hopc that the amount of campulayon required will grow slowly with more difficult pn^lems (see the 
computational complexity section, below). 

This simulation illustrated three significant points. First, the implementation of 
sysacm 20 was able to ccmsislently improve as it evolved. The improvement was apparent in both the 
5 maximum and the average performance in the population, and in both the cross-validation accuracies 
of the learning agents and in their fimesses (i.e. accuracy, speed and robustness). Figure 1 1 shows 
these results. In ten generations, the average fitness of the population climbed from 5I%% to 75%, 
and the standard deviation fell fixMn 21% to 12.5%. Looking at the best individual in the populaticm 
shows that the maximum fitness climbed ftom 84% to 94%, and the maximum cross-validaticMj 

10 accuracy climbed from 82.3% to 95%. Average execution time fell by more than 25% for both 

classes of learning agents (it fell somewhat more sharply, and in much greater absolute mngnit^wl^' for 
the neural network learning agents.) 

Figure 1 2 illustrates the separate contribuli<Mis of the genetic cvohilion component and 
the mcmctic evohition compcment compared to coevolution in the performance of the best individual 

1 5 learning agent in the population. Consider first the purely genetic evdution of the fiee parameter 

vectors. When there is no memetic ccMnponcnt, there is no interaction between the diflFerent classes of 
learning agents, since crossover and other sources of genetic variation apply only to a specific type of 
learning agent. Using genetic search to find the best free parameter values for a neural network is 
known to be slow (Yao, 1993). In this genetic-only neural network simulation, only a small 

20 improvement was found in the fifth generation. Similarly for the genetic evolution of LFC4-f *s free 
parameters, a modest difference was found in the sixth generation. Memetic evolution alcme (without 
changing any of the fiee parameter values) showed a more significant efl'ect Because both the neural 
network and the constructive induction program make contributions to memetic evolution, any 
synergies between them will appear in the memetic-only curve (see discussion of figure 4, below). 

25 However, the steady rise of the coevolution curve, compared to the memetic-evoluticm cmly curve 
suggests that scMne adjustment of the fiee parameter values to match the representation changes 
induced by memetic transfer may have a positive eflFect At the end of 10 goierations. the maxhnum 
accuracy of the coevolved population is took than 5 percentage points higher than the raonetic only 
population. Mowever, it is worth noting that the memetic only peculation equaled that performance 

30 figure earlier in the simulation, and then drifted away. 

Figure 1 3 illusU^es the synergy that coevolulirai fmds between the CG agent 
population and the LFC-H- agent population. When run with only a single type of learning agent, the 
memetic evolution part of the inventive system becomes a kind of constructive induction program, 
where the output of the learning agent is feed back as an input feature in the next iteratira. Since 

35 LFC++ is aireatty a constructive induction program, it is somewhat surprising to note that LFC++ 

agcnts cocvolving with each other still manage a small amount of improvement Perhaps this is due to 
the fact that the appropriate features for this problem have three conjuncts, and LFC-H- under most free 
parameter values tends to consmict new features from pairs of input features. CG alone does not do 



wo 97/44741 



PCr/US97/08951 



• 13- 

veiy well on this problem. Even when coevolved with itself, its most accurate agent improves only 
from about 60% to a bit over 70% in ten generations. However, when these two learning methods arc 
combined in a single coevolution learning agent, the results are clearly belter than either of the methods 
uscdalone. The gap between the best performance of the pair of methods and the ^ 
5 tends to be about 5 percentage points m this simulation, and the difference tends to be larga^ as the 
population evolves. 

In view erf the many possible anbodiments to which the principles of the invention 
may be applied, it should be recognized that the iUustraled embodiment is only a preferred example of 
the invention and not as a limitation on its scope. Thescopeof the invoition.rBther. is defined by the 
10 following claims. I thercfOTeclaim as my inventiwi all that comes within the scope and ^iril of these 
claims. 
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I claim: 

1 . A computar-iinpleinented method of producing a prediction method for a 
problem, the method con^rising the following steps: 

providing training data related to a problem for which a prediction method is sought, 
the training data represented by features with valu^; 

fwoviding at least two learning agents, the agents including input representations; 

training the learning agents on the data set, each agent producing in response to the 
data a prediction method based on the agent's input rqiresentation; 

extracting feature combinaticms from the prediction methods produced by the learning 



10 

modifying the input representation of a learning agent by including m the input 
representation a feature OHnbination extracted from another learning agent; and 

again training the learning agents on the data set to cause a learning agent lo produce a 
prediction method based on the agent's modified input rcprescntaticm. 
15 2. The method of claim 1 wherein the exU-acting step comprises: 

analyzing a prediction method to identify which fcauires of an agent's ii^ut 
representati(xi arc more impoitant and less important to the prediction method; and 

combming at least two of the more important features to produce a feature 

combination. 

20 3, The method of claim I including: 

collecting feaUirc combinations extracted from the prediction methods; and 
determining a fitoess measure for each of the collected feature CQmbmations, 
wherein the modifying step includes selecting a feature combinatiwi for an agent's 
input representation based on the feature combiiution's fitness measure. 

^5 4. The nicthodofclaim 3 whcrdn the step ofdetenrnining a filiKiss measure for 

a feature combination comprises: 

globally assigning a fitness value to each feature ccxnbination; 

combining the combination's fitness value with a preference value to prxxiuoe tl» 

fitness measure. 

^0 5. The method of daim 1 wherein the learning agents are of two different fypes. 

6. The method of claim I \K4)ere)n the learning agents are of the same type but 
have different parameter values. 

7. The method of claim 1 wherein the learning agoits have a same list of 
parameters but different parameter values, the method including: 

^5 determining a fitness measure for each learning agent based on the prediction method 

the agent produces; 

selecting the parameter values of a learning agent based on the agent's fitness 

measure; 
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inlroducing variation into the selected parameter values; and 
defining another learning agent using the varied parameter values. 
8. The method of claim 7 wherein the step of dctcrminhig a fitness measure 
comprises detennining an accuracy of the prediction method and a completion time for the learning 



9. A COTiputer-readable medium on which is stored a computer program 
comprising instructions which, when executed, perfonm the steps of claim 1. 

10. A computcr-implemenled method of producing a prediction method, the 
method ccmiprising the following steps: 

providing training data related to a problem for viliich a prediction method is sought, 
the training data represented by features with values; 

providing at least two learning agents having a same paramctCT hsl but different 
parameter values; 

training the learning agents on the data set. each agent producing in respcmse to the 
1 5 data a prediction method based on the agent's parameter values; 

determining a fitness measure for each learning agent based on the prediction method 
the agent produces; 

selecting the parameter values of a learning agent based on the agent's fitness 

measure; 

introducing variation into the selected parameter values; 
defining another learning agait using the varied paramo values; and 
again training the learning agents on the data sci to cause a learning agent to produce a 
prediction method based on the varied parameter values. 

1 1 - A computer-implemented method of producing a predicUon method, the 
25 method comprising the following steps: 

providing training data related to a problem for which a prediction method is sought, 
the training data represented by features with values; 

providing a plurality of learning agents of which at least two have a same parameter 
list but different parameter values, the learning agents including input r^)rescntations; 

^'^^e ^ ^'^^"e ^enls on the data set. each agent producing in response to the 
data a piedicticm method based on the agent's parameter values and input representation; 

detennining a fitness measure for each learning agent based on the predicUon method 
the agent produces; 

selecting the parameter values of a learning agent based on the agent's fitness 

35 measure; 

introducing variation into the selected parameter values; 
defining another learning agent using the varied parameter values; 
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extracling feature combinations from the prediction methods produced by the learning 

agaits; 

modifying the ii^)Ul representation of a learning agent by including in the iiq>ut 
represoitation a feature combination extracted from another learning agait; and 
5 again training the learning agents on the data set to cause a learning agent to produce a 

prediction method based on the varied parameter set and the modified input representation. 

1 2. A computer-implemented system for producing a predicticm method for a 
problem, conqmsing: 

training data related to a problem for which a prcdictiwi method is sou^l, the training 
1 0 data represented by features with values; 

at least two learning agents, the agents inchiding input representations; and 
at least one computer programmed for: 

training the learning agents on the daU scU each agent producing in response to the 
data a prediction method based on the agent's input representation; 

* ^ extracting feature combinations from the predicticm methods produced by the learning 

agents; 

modifying the input representaticm of a learning agent by including in the input 
representation a feature combination extracted from another learning agent; and 

again training the learning agents on the data set to cause a learning agent to produce a 
20 prediction method based on the agent* s modified input representation. 

13. The system of claim 12 wherein at least two learning agent have a same paramo 
list but different parameter values, and the computer is further programmed for: 

training the learning agents on the data scU each agent producing in response to the 
data a prediction method based on the agent^s parameter values; 
25 determining a fitness measure for each learning agent based on the prediction method 

the agent produces; 

selecting the parameter values of a learning agent based ot the agent 's fitness 

measure; 

intixxlucing variation into the selected parameter values; 
30 defining another learning agent using the varied parameter values; and 

again training the learning agents on the data set to cause a learning agent to produce a 
prediction method based on the varied parameter values. 
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